Home

Column

Gender Classification via Voice

Jake Whalen

CS 584 Final Project
Fall 2017
Start

Summary

Column

Choosing a Project
  • Topic? Sports, Beer, Other
  • Supervised or unsupervised learning?
  • Data Source: Download, Web Scrape, Social Media
  • Tools: Python, R, Weka, Tableau, Excel

Choice
  • Data from Kaggle
  • Audio Analysis
  • Supervised Learning
  • Classification
  • ML in Python
  • Presentation & Report in R Markdown
  • Excel for results transfer

Goals
  • Classify audio clip subjects gender
  • Learn what features best seperate gender in audio
  • Look for other potential clusters within the data

Method

Column

Exploration

  1. Read the data into R/Python
  2. Ran summary functions on features
  3. Plotted the data
  4. Look for patterns and relationships between features
  5. Determine what features seperate genders best

Column

Classification

  1. Used Scikit-learn in Python
  2. Split the data for training/testing (2/3, 1/3)
  3. Used gridsearch to identify the best parameters
  4. KNN (K-Nearest Neighbors)
  5. Decision Tree (DT)
  6. Suport Vector Machine (SVM)
  7. Logistic Regression (Log R)
  8. Observed prediction outcomes. Could do better.
  9. Attempt to Improve on initial results
  10. KNN: Transform data with PCA
  11. Decision Tree: Use multiple trees with Random Forest
  12. SVM: Transform data with PCA
  13. Log R: Normalized data from 0 to 1 in each feature

Machine Learning

Column

Review

  • Confusion Matrix
  • Overall Accuracy Scores
  • Male Accuracy
  • Female Accuracy
  • ROC & AUC
  • Parameter Influence
  • Fit & Score Times

Overview

Column

Description

Dataset Comments
  • Database created to identify a voice as male or female, based upon acoustic properties of the voice and speech.
  • The dataset consists of 3,168 recorded voice samples, collected from male and female speakers.
  • The voice samples are pre-processed by acoustic analysis in R using the seewave and tuneR packages, with an analyzed frequency range of 0hz-280hz (human vocal range).
  • The samples are represented by 21 different features
  • Source: Voice Gender Data

Definitions

Sample

EDA

Column

Classes

Distributions

Boxplots

T Test

Heatmap

Scatter Plot

3D Plot

KNN

Column

K-Nearest Neighbors

Summary
  • Used untransformed data
  • Better then a dumb classifier (50/50)
  • Distance weights outperformed Uniform weights
  • P: Manhattan Distance produced better CV results (p=1)
  • Algorithm: auto attempts to decide the most appropriate algorithm based on values
  • Weights: distance weights points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away
Best Parameters
  • algorithm = auto, n_neighbors = 11, p = 1, weights = distance

Column

Decision Tree

Column

Decision Tree

Summary
  • Used untransformed data
  • MeanFun, sp.ent & IQR account for +90% of feature importance
  • Presort: presort the data to speed up the finding of best splits in fitting
  • Splitter: The strategy used to choose the split at each node
  • Better at identifying males
  • Easiest model to interpret (follow the branches)
  • Tree
Best Parameters
  • criterion = gini, max_depth = 21, presort = TRUE, splitter = random

Column

SVM

Column

Support Vector Machine

Summary
  • Modified penalty parameter to acheive better results
  • Higher penalties acheived better scores
  • Better at classifying males
Best Parameters
  • C = 48

Column

Log Reg

Column

Logistic Regression

Summary
  • Untransformed data
  • Best Male accuracy
  • Outperformed Log Reg (Normal)
  • C: Inverse of regularization strength
  • fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
  • penalty: Used to specify the norm used in the penalization
Best Parameters
  • C = 0.7000000000000001, fit_intercept = TRUE, penalty = l1

Column

Random Forest

Column

Random Forest

Summary
  • Best Female accuracy
  • Took longer to fit then Decision Tree
Best Parameters
  • criterion = entropy, max_depth = 9, n_estimators = 15

Column

KNN (PCA)

Column

K-Nearest Neighbors (PCA)

Summary
  • Best overall accuracy
  • 9 PCA components used
  • The fewer the neighbors the better
Best Parameters
  • algorithm = auto, n_neighbors = 3, p = 1, weights = distance

Column

SVM (PCA)

Column

Support Vector Machine (PCA)

Summary
  • Improvement over SVM on untransformed data
  • Adjusted penalty parameter C of the error term
  • Acheived best performance at much lower penalty parameter levels
Best Parameters
  • C = 10

Column

Log Reg (Normal)

Column

Logistic Regression (Normalized)

Summary
  • Performed worse then Log Regression on untransformed data
  • Decrease in performance due to decrease in Male accuracy
  • Slight improvement in Female accuracy compared to first Log R
Best Parameters
  • C = 0.9000000000000001, fit_intercept = TRUE, penalty = l1

Column

Conclusions

Criteria


Accuracy
  1. KNN (PCA)
  2. Random Forest
  3. Log Regression
Male Accuracy
  1. Log Regression
  2. Log Regression (Normal)
  3. KNN (PCA)
Female Accuracy
  1. Random Forest
  2. KNN (PCA)
  3. Log Regression (Normal)
AUC
  1. Random Forest
  2. Log Regression
  3. Log Regression (Normal)

ROC


Area Under the Curve
  • KNN: 0.8899249
  • Decision Tree: 0.9606488
  • SVM: 0.9611217
  • Log Reg: 0.9961107
  • KNN (PCA): 0.9921023
  • Random Forest: 0.9979454
  • SVM (PCA): 0.9930792
  • Log Reg (Normal): 0.9955755

Fitting Times

Scoring Times

Conclusion

Best Model
  • Best Model: Random Forest
  • 2nd highest overall accuracy
  • 1st Female accuracy
  • Highest Area Under the Curve
  • Decent Fitting Time
  • Faster Scoring Time
Improvements
  • Focus on a single method
  • Combine features to create new ones
  • Implement more advanced methods (Bagging/Boosting)
  • Extract features from raw aaudio files